TL;DR: we are announcing Wave 2 Call for Speakers for AIE World’s Fair this summer - apply here: https://sessionize.com/aiewf2026/ ESPECIALLY if you have projects relevant to our new tracks in Autoresearch, Memory, World Models, Tokenmaxxing, Agentic Commerce, and Vertical AI in Law, Healthcare, GTM and Finance!

In January we laid out plans for Scaling without Slop and despite some content exhaustion risk, your reception has been positive, with AIE viewership now trending to at least double 2025’s peak, serving over a million unique AI engineers a month.

This year is our first in Moscone West, doubling for the 3rd year in a row in our mission to bring all of the AI Engineering world to San Francisco to showcase the must-know research and product engineering work of the year, as well as to hire, fundraise, and close business deals. Sales are going well, but traditionally we do one callout a year for the World’s Fair to widen our net for people who might not traditionally think to submit a talk (because they didn’t know we were interested!).

This year we are adding an entire day’s worth of talks to the schedule, so on top of the all the evergreen themes we covered in 2025 and in Europe, we’re adding a few more new ones that I am specifically soliciting applications (and sponsors!) to cover:

Autoresearch: recursive self improvement loops in harnesses and model training!
Tasteful Tokenmaxxing: as a company leader, how do you make your AI Eng teams 10x more AI-Native/scale AI adoption, BUT without Goodharting waste?
Memory: how are your agents/models improving as your users use them?
World Models: how are you solving spatial intelligence and adversarial reasoning?
Agentic Commerce: how are agents paying for data, APIs, and other agents?
Vertical AI in Law, Healthcare, GTM and Finance: how are you applying AI in these specific domains? We are also open to submissions for AI in Government and AI in Education, though generally these seem less fast-moving.
Robotics: last year, Physical Intelligence, Waymo, Tesla, Nvidia, K-Scale (RIP) and others presented their approaches to autonomy; this year WE ARE ALLOCATING FREE EXPO FLOOR SPACE FOR GOOD ROBOTICS DEMOS. (contact hello@ai.engineer to set up your demo area! Humanoids must be accompanied.)
Founders: a new Startup Battlefield event will be added where you can pitch your pre-series A company to our panel of top VCs and guest judges.

There are other new tracks, which you can find in the full application form (don’t constrain yourself to tracks, just submit your best work and we’ll find a place for you)

If you already applied and were accepted in Wave 1, you should receive an email in your inbox informing you so - if not, don’t fret, you’ll still be considered in Wave 2, no further action needed.

This is for everyone else who weren’t aware we are soliciting applications for the biggest technical AI event of the year - especially if you know someone who would be PERFECT to talk about some of these topics we are calling out, then we need your help to reach them.

Apply here - and book your ticket/travel asap (because things are filling up fast for the World Cup also taking place in SF that week) — we will refund successful applicants. (Also contact hello@ai.engineer if you need an invitation letter for international visa).

AI News for 4/30/2026-5/1/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Grok 4.3’s Release, Benchmark Deltas, and the Open-vs-Closed Frontier

xAI shipped Grok 4.3 with materially better cost/performance, but mixed eval reception: Early chatter flagged an imminent API launch from @scaling01, followed by a detailed benchmark breakdown from Artificial Analysis. On their Intelligence Index, Grok 4.3 scores 53, up 4 points over Grok 4.20, with roughly 40% lower input and 60% lower output pricing. The biggest gain was on GDPval-AA, up 321 Elo to 1500, suggesting stronger real-world agentic task performance. It also hit 98% on τ²-Bench Telecom and held 81% on IFBench. The tradeoff: AA-Omniscience accuracy rose while non-hallucination dropped by 8 points, leaving concerns about reliability despite stronger capability. Arena has already added it across text, vision, document, and code modes via @arena.
Community reaction was split between “meaningful iteration” and “still behind top open models”: Several posts argued Grok is improving faster than critics admit, including @teortaxesTex, who noted token-efficiency gains as well, while others were more skeptical. @scaling01 claimed “Grok-4.3 still behind chinese open-source”, and Andon Labs reported a major regression on Vending-Bench 2, where Grok allegedly preferred to “sleep” rather than act. The more structural critique came from pricing and infra economics: @teortaxesTex argued Grok’s low prices may be subsidized by poor hardware utilization and that cache economics, not only model quality, increasingly determine agentic TCO.

DeepSeek V4 Pro, Vision/Spatial Reasoning, and Open-Weights Closing the Gap

DeepSeek V4 Pro appears to be the most credible open-weight coding/agent model in this batch: The strongest hands-on report came from @omarsar0, who tested DeepSeek-V4-Pro inside the Pi coding agent and described it as the first open-weight model that genuinely feels comparable to Codex or Claude Code for multi-turn agentic coding. Key systems details included 1M context, a hybrid CSA/HCA attention design, KV cache reduced to 10%, and nearly 4x lower inference FLOPs at long context. The report also emphasized practical harness fit: no custom setup, stable traces, and viable multi-step research/coding loops on Fireworks inference.
The broader benchmark picture confirms open weights are now much closer, though still behind on hardest tasks: Artificial Analysis noted that the three leading open-weight models released last week—Kimi K2.6, MiMo V2.5 Pro, and DeepSeek V4 Pro—now score 52–54 on the Intelligence Index, versus 57 for Gemini 3.1 Pro Preview and Claude Opus 4.7, and 60 for GPT-5.5. These top open models are all trillion-plus MoE systems with permissive licenses: Kimi at 1T/32B active, MiMo at 1T/42B active, and DeepSeek V4 Pro at 1.6T/49B active. The remaining gap is concentrated in HLE, CritPt, TerminalBench Hard, and hallucination-heavy Omniscience.
DeepSeek’s multimodal direction seems centered on explicit spatial grounding: Speculation about DeepSeek-Vision outperforming V4-Pro on ARC-AGI-2 because of actual spatial reasoning came from @teortaxesTex. A later summary of a briefly posted-and-deleted tech report from ZhihuFrontier described a multimodal CoT system that can “point while thinking” using boxes and points embedded directly into reasoning traces to reduce the “reference gap” in counting, maze solving, and path tracing. The stack reportedly uses DeepSeek-ViT, CSA compression, and V4-Flash (284B total / 13B active). Even if early tests still show weaknesses, it is a notable architectural bet: turning visual reasoning into explicit grounded computation rather than plain text description.

Codex’s Rapid Product Expansion vs Claude Code, Devin, and Other Agent Runtimes

Codex is winning on product velocity and UX polish, not just base model quality: A major theme across tweets was how quickly the Codex app is improving. High-engagement praise came from @gdb, @theo, and others comparing its feel favorably to alternatives. OpenAI added a device toolbar for responsive testing and improved browser-use speed by ~30% in “vibe testing,” per @JamesZmSun. It also added CI status in chat via @reach_vb, migration/import tooling for settings/plugins/agents via OpenAI, and a surprisingly viral pets system in Codex via @OpenAIDevs. While whimsical, the repeated point from users was that OpenAI is shipping a cohesive environment, not just a model endpoint.
Codex vs Claude Code is increasingly framed as UX + speed + taste tradeoffs: @theo summarized the current frontier coding vibe: GPT-5.5 is “smarter and can unblock you,” while Opus 4.7 has better intent/taste but can wander. In a second post, he argued Claude Code feels much slower on TTFT/TPS and requires more tool calls, while GPT/Codex feels more direct and economical for “fast mode” style use (tweet). Still, public benchmark comparisons are mixed: @scaling01 said GPT-5.5 did not beat Opus 4.7 on PostTrainBench in the Claude Code harness, highlighting how much results remain harness-dependent.
Other agent runtimes are converging on similar primitives: Devin launched “inside your shell” hotkey access via @cognition. Hermes added a /goal loop with a supervisor model forcing the agent to continue until completion, via @Teknium. Flue, introduced by @FredKSchott, positions itself as a TypeScript framework for headless autonomous agents, “like Claude Code but programmable.” The common pattern across these launches is that the competitive surface is moving from raw model IQ to agent harness design: subagents, browser-use, durable state, compaction, skills, and feedback loops.

Agent Infrastructure: Retrieval, Memory, HITL, and Durable Execution

The strongest research signal was that agent systems are bottlenecked by runtime design, not just model quality: Two especially useful papers were highlighted. First, ReaLM-Retrieve, summarized by @omarsar0, argues that reasoning models need retrieval during inference rather than only before it. It reports +10.1% absolute F1 over standard RAG and 47% fewer retrieval calls than fixed-interval IRCoT, with 3.2x lower per-retrieval overhead. Second, OCR-Memory, shared by @dair_ai, stores long-horizon trajectories as images with indexed anchors, retrieving exact prior content instead of lossy text summaries; it reports SOTA on Mind2Web and AppWorld under strict context limits.
LangChain/LangGraph pushed hard on production primitives for multi-user and human-in-the-loop agents: @sydneyrunkle outlined three concrete multi-user deployment concerns—data isolation, delegated credentials, and operator RBAC—and mapped each to LangSmith Agent Server features. Later posts covered a new HITL mode where a human reply can be returned directly as a tool result (tweet) and durable pause/resume semantics for consequential actions or unresolved judgment calls (tweet). This is a good snapshot of where real deployment complexity is moving: auth boundaries, persistent state, and explicit intervention points.
Durable execution is becoming a first-class runtime feature across stacks: Cloudflare announced Dynamic Workflows for adding durable execution to agent plans via @celso. LangChain positioned create_agent as the low-level primitive beneath Deep Agents, with extensibility for filesystems, bash, compaction, hooks, and subagents via @Vtrivedy10. The meta-point is consistent with one linked technical blog: the agent runtime itself—sandboxing, replay, checkpointing, orchestration—has become hidden technical debt and a major source of differentiation.

Research and Systems Papers Worth Bookmarking

Recursive / latent-space multi-agent coordination is emerging as a serious alternative to text-only agent chatter: @omarsar0 summarized Recursive Multi-Agent Systems, where agents communicate through shared latent recursive computation instead of full natural-language exchanges. Reported gains: 8.3% average accuracy improvement, 1.2x–2.4x end-to-end speedup, and 34.6%–75.6% token reduction across nine benchmarks. If agent-to-agent communication cost becomes dominant, this line of work matters.
Meta FAIR’s “self-improving pretraining” idea may be one of the more consequential training-time papers in the batch: @omarsar0 highlighted a method where a strong post-trained model rewrites pretraining suffixes toward safer, higher-quality continuations and then judges model rollouts during RL-style pretraining. Reported improvements include 36.2% relative gain in factuality, 18.5% in safety, and up to 86.3% win rate in generation quality over standard pretraining.
Microsoft’s synthetic long-horizon computer-use worlds look like a credible data recipe: @dair_ai described a system that creates 1,000 synthetic computers with realistic files and documents, then runs 8-hour agent simulations averaging 2,000+ turns. The thesis is straightforward and important: for computer-use agents, the bottleneck is no longer only model capability but scalable, realistic experiential data.

Top tweets (by engagement)

OpenAI/Codex momentum: OpenAI says GPT-5.5 is its strongest launch yet, with API revenue growing 2x faster than prior releases and Codex doubling revenue in under seven days.
Defense/government adoption: The U.S. “Department of War” CTO announced agreements with seven frontier AI and infrastructure companies to deploy capabilities on classified networks.
OpenAI messaging pivot on labor: Sam Altman: “we want to build tools to augment and elevate people, not entities to replace them”, with follow-up comments on jobs and future work here.
Codex adoption and delight: “codex app becoming incredible” from @gdb, plus Codex pets unexpectedly becoming one of the day’s biggest product-engagement hits.
Model benchmarking reality check: ARC Prize reports GPT-5.5 at 0.43% and Opus 4.7 at 0.18% on ARC-AGI-3, with analysis of failure modes.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen Model Developments and Benchmarks

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090 (Activity: 339): The post introduces PFlash, a speculative prefill technique for long-context decoding on quantized 27B targets using C++/CUDA, achieving a 10x speedup over vanilla llama.cpp on an RTX 3090. This method leverages a small drafter model to score token importance, allowing the main model to focus only on significant spans, thus reducing prefill time significantly. The implementation combines insights from recent papers on speculative prefill and block-sparse attention, and is executed entirely in C++/CUDA without Python or PyTorch, making it efficient for consumer-grade GPUs like the RTX 3090. The repository is available on GitHub. Some commenters express skepticism about the claimed 10x speedup, with one noting the approach as potentially ‘super lossy’ due to its compression method. Another user reports out-of-memory issues on a 4090, indicating potential challenges in replicating the results.
- randomfoo2 highlights a novel approach in PFlash that involves using a smaller Qwen3-0.6B drafter to process the full 64K/128K prompt with FlashPrefill/BSA-style sparse attention, which reduces the computational cost. The drafter evaluates token/span importance, retaining only a crucial subset for the 27B target model to prefill, followed by speculative decoding using DFlash+DDTree on the compressed target KV. This method is noted for being ‘super lossy,’ indicating potential trade-offs in accuracy for speed.
- qwen_next_gguf_when raises concerns about the practicality of the PFlash method, noting that the DFlash component tends to run out of memory (OOM) on an RTX 4090. This suggests potential limitations in hardware compatibility or efficiency, which could impact the method’s replicability and scalability across different systems.
- Obvious-Ad-2454 expresses skepticism about the claimed 10x speedup, suggesting it might be too optimistic without independent verification. This comment underscores the importance of replication studies to validate performance claims in machine learning, especially when such significant improvements are reported.
Qwen 3.6 27B vs Gemma 4 31B - making Packman game! (Activity: 994): In a local LLM gamedev contest, Gemma 4 31B outperformed Qwen 3.6 27B in creating a Pac-Man style game on a MacBook Pro M5 Max with 64GB RAM. Gemma processed 27 tokens/sec and completed the task in 3m 51s with 6,209 tokens, while Qwen processed 32 tokens/sec over 18m 04s with 33,946 tokens. Despite Qwen’s more creative and visually styled output, Gemma’s solution was shorter, clearer, and more logical, excelling in game logic, interaction handling, and performance stability. The task required generating a complete HTML-based game with procedural graphics and no external libraries, focusing on smooth gameplay and stable performance using requestAnimationFrame and delta time for animations. Commenters noted the humor in the prompt’s demand for ‘no bugs’ and questioned the utility of vague prompts, suggesting they primarily test a model’s pre-existing knowledge rather than its problem-solving ability.
- Qwen 3.6 27B was tasked with creating a Pacman clone using a single HTML page and any libraries or graphics sources it deemed necessary. Interestingly, the model did not perform any external downloads or research, instead relying on its pre-existing knowledge to code the game. This highlights the model’s ability to generate functional code from minimal prompts, though it raises questions about the depth of its understanding and adaptability to new resources.
- A user pointed out that the ghost enemy movement in the Gemma 4 31B version of the Pacman game appears to be malfunctioning. This suggests potential issues with the model’s ability to accurately implement game logic, particularly in handling dynamic elements like enemy AI, which is crucial for a game like Pacman.
- The discussion raises concerns about the utility of using vague prompts for testing AI models, as noted by a commenter who described such prompts as “benchmaxxing tests.” This implies that the tests may not effectively evaluate the model’s problem-solving capabilities or its ability to adapt to new tasks, but rather assess its pre-existing knowledge base.
Qwen-Scope: Official Sparse Autoencoders (SAEs) for Qwen 3.5 models (Activity: 437): The Qwen Team has released Qwen-Scope, a set of Sparse Autoencoders (SAEs) for the Qwen 3.5 models, ranging from 2B to 35B MoE. This tool maps internal features across all layers, functioning as a dictionary of the model’s internal concepts, allowing for precise manipulation of features such as ‘legal talk’ or ‘Python code’. Key functionalities include Surgical Abliteration to suppress specific features, Feature Steering to activate desired concepts, Model Debugging to identify token-triggered directions, and Dataset Analysis to verify feature activation. The tool is released under the Apache 2.0 license but with a caution against removing safety filters. A practical example includes diagnosing unexpected language switches using a heatmap to identify over-activated features. More details can be found in the Qwen-Scope paper and the Hugging Face Space. Commenters highlight the significance of this release, noting it as potentially the largest open-source interpretability tool for dense models, surpassing Google’s GemmaScope in scale. There is anticipation for future iterations, such as Qwen 3.6, to incorporate similar tools.
- NandaVegg highlights the significance of the release of Sparse Autoencoders (SAEs) for the dense 27B Qwen model, noting it as potentially the largest open-source interpretability tool to date. This is in contrast to previous tools like GemmaScope, which only supported smaller models such as 9B and 2B, indicating a substantial advancement in model interpretability capabilities.
- robert896r1 expresses anticipation for the release of Qwen 3.6 or community-driven adaptations of the current tools for newer iterations. This reflects a common trend in the AI community where tools and models are rapidly iterated upon, and there is a need for compatibility with the latest versions to maintain relevance and utility.
- oxygen_addiction speculates on the use of feature steering in large AI models, such as ChatGPT5, suggesting that advanced routing mechanisms could be employed to select the most appropriate model for a given prompt. This points to a potential future where AI systems dynamically optimize their responses by leveraging multiple models and interpretability tools.
Qwen3.6-27B-Q6_K - images (Activity: 388): The post discusses the use of the Qwen3.6-27B-Q6_K model to generate SVG images based on creative prompts, such as a pelican riding a bicycle and a Victorian-era robot reading a newspaper. The model’s performance is measured in terms of time and throughput, with times ranging from 3min 10s to 8min 24s and throughput around 27 t/s. The images were generated using the Open Visual tool in Open WebUI (GitHub link). The post lacks specific hardware or framework details, which are crucial for evaluating the performance metrics provided. One commenter noted the absence of hardware and framework details, which are essential for interpreting the performance statistics. Another comment humorously appreciated the whimsical nature of the generated images, likening them to early 2000s email forwards.
- The user ‘ZealousidealBadger47’ reports a performance metric of 10.71 tokens per second for the Qwen 3.5 122b-a10b IQ4_XS model, which provides a benchmark for evaluating the model’s efficiency in processing data. This metric is crucial for understanding the model’s throughput and potential bottlenecks in real-time applications.
- ‘Ok-Importance-3529’ mentions the use of ‘Autoround quant’ with the Qwen3.6-27B-Q2_K_MIXED.gguf model, linking to a Hugging Face repository. This suggests an interest in model quantization techniques, which are essential for optimizing model performance and reducing computational load, especially in resource-constrained environments.
- ‘balerion20’ highlights the importance of providing hardware specifications, context size, and framework details when discussing model performance. This underscores the necessity of context in interpreting performance metrics, as these factors significantly influence the model’s speed and efficiency.
Devs using Qwen 27B seriously, what’s your take? (Activity: 785): Qwen 27B, a large language model, is being evaluated by developers for its coding capabilities, akin to Codex. Users report it as ‘solid’ but not consistently outperforming models like GPT-5.5. A user shared a GitHub commit showcasing Qwen 27B’s ability to refactor code effectively, though they wish for faster processing speeds (~120 tokens/second). Another user successfully runs Qwen 27B on llama.cpp with pi, noting it could substitute Claude Code if tasks are broken down and documentation access is provided to mitigate knowledge gaps. Some users feel Qwen 27B is ‘good enough’ for their needs, while others note it lacks a certain ‘extra something’ compared to other models. The need for task breakdown and documentation access is seen as both a limitation and a learning opportunity.
- Unlucky-Message8866 highlights the practical utility of Qwen 27B for code refactoring, specifically mentioning its ability to handle ESLint errors effectively. However, they express a desire for improved processing speed, ideally around 120 tokens per second.
- itroot discusses using Qwen 27B with llama.cpp and compares it to Claude Code, noting that while Qwen 27B requires more task breakdown and has knowledge gaps, it can perform similarly if supplemented with documentation access or cloud model assistance.
- formlessglowie shares a detailed experience of optimizing Qwen 27B’s performance using vLLM and MTP speculative decoding, achieving 50+ tokens per second with INT4 in a 262k FP8 context. They compare it favorably to past state-of-the-art models like Sonnet 3.7 and Gemini 2.5 Pro, emphasizing its modern capabilities despite not matching current top-tier models like GPT/Opus.
Qwen 3.6 35b a3b is INSANE even for VRAM-constrained systems (Activity: 574): The post discusses the performance of the Qwen 3.6 35B-A3B model on a VRAM-constrained system, highlighting its ability to handle complex coding tasks locally. The user, with a setup of AMD 7700 XT, 32GB DDR4 RAM, and Ryzen 5 5600, successfully ran the model using i1-q4_k_s quant, offloading all 40 layers to GPU, and configured 128k context with flash attention and Q8_0 KV quantization. The model effectively resolved complex bugs in a web scraper app and updated a project README with screenshots, outperforming previous models like Gemma 3, Gemma 4, and Qwen 2.5 Coder. This demonstrates the model’s capability to perform well even on hardware with limited resources, making local AI coding more practical. Commenters suggest optimizing performance by moving extra experts to CPU and fitting the KV cache on GPU to increase speed beyond 30 t/s. Another user notes achieving 35-40 tok/s with similar hardware, indicating potential for further optimization.
- GoldenX86 suggests optimizing performance by moving extra experts to the CPU while keeping the KV cache on the GPU, which can enhance speed to over 30 tokens/second. This approach leverages the CPU for less critical tasks, freeing up GPU resources for more intensive operations.
- AI_Enhancer discusses achieving 35-40 tokens/second processing speed, noting that prompt complexity significantly affects response time. They highlight that even with complex prompts, the model’s thinking time is capped at about 1 minute, suggesting efficient handling of difficult queries.
- cmplx17 shares a comparative analysis with Claude, noting that Qwen 3.6 exceeded expectations, especially in local model performance. This indicates significant advancements in model capabilities, making local models more competitive with cloud-based solutions.

2. Hardware and Infrastructure Setups

16x Spark Cluster (Build Update) (Activity: 1024): The image depicts a 16x Spark Cluster setup, which is part of a high-performance computing build using NVIDIA’s DGX Spark units. Each Spark runs on NVIDIA’s Ubuntu and connects to an FS N8510 switch via QSFP56 cables, achieving dual rail connectivity with up to 200 Gbps throughput. The setup is designed to maximize unified memory capacity, crucial for tasks like serving GLM-5.1-NVFP4 models. The cluster is intended for prefill tasks, with plans to integrate M5 Ultra Mac Studios for decode operations. The build emphasizes efficient memory use within the NVIDIA ecosystem, contrasting with alternatives like the RTX Pro 6000 Blackwell, which offers different trade-offs in terms of power and performance. One commenter suggests considering the RTX Pro 6000 Blackwell as an alternative, noting its potential for similar performance with possibly easier management and power considerations. Another commenter appreciates the build’s approach to addressing Mac prefill issues with a robust cluster setup.
- flobernd discusses the potential benefits of using 8x RTX Pro 6000 Blackwell GPUs instead of the current setup. They highlight that this alternative could offer a similar price point with the advantage of a single host configuration. Despite higher power usage, the RTX Pro 6000 Blackwell can efficiently run models like Kimi26 and GLM51-nvfp4 with excellent prefill and over 100 tokens per second, even with PCIe bottlenecks, which are also present in the current setup due to 200G NICs.
- TheRealSol4ra questions the choice of the current setup over using 8 RTX 6000 Pro GPUs, which provide 768GB of VRAM. They argue that this amount of VRAM is sufficient for running models at FP8 or Q6 precision, and while the current setup can run any model, it might be limited to 15-25 tokens per second, which is less efficient compared to the RTX 6000 Pro configuration.
AMD Halo Box (Ryzen 395 128GB) photos (Activity: 1033): The AMD Halo Box, featuring a Ryzen 395 processor and 128GB of RAM, was showcased running on Ubuntu. The unit includes a programmable light strip, enhancing its customization capabilities. However, it lacks a CD-ROM drive, which might be a consideration for some users. A notable comment highlights a desire for increased memory bandwidth in AMD products, suggesting that this is a recurring request among users.
- FoxiPanda highlights a critical performance aspect by suggesting that AMD should focus on increasing memory bandwidth. This is a significant factor in improving overall system performance, especially for high-demand applications that rely on rapid data access and processing.
- OnkelBB points out the lack of a fast port for clustering, which could limit the device’s utility in high-performance computing environments where multiple units are networked together to work on complex tasks. This could be a drawback for users looking to leverage the device in a clustered setup.

3. Other notable frontier-model / infra posts

Open Models - April 2026 - One of the best months of all time for Local LLMs? (Activity: 767): The image is a bar chart illustrating the parameter sizes of various local Large Language Models (LLMs) as of April 2026, highlighting a significant month for advancements in local LLMs. The chart features models like “DeepSeek-V4-Pro-Max” with 1600 billion parameters, and others like “Kimi-K2.6,” “MiMo-V2.5-Pro,” and “Ling-2.6-1T,” each with 1000 billion parameters. Notably, the “MiniMax-M2.7” model is absent from the graph due to a license change from MIT to Non-Commercial, indicating a shift in accessibility or usage rights. One commenter humorously notes running the 1600B model on a Raspberry Pi, highlighting the impracticality of such a large model on limited hardware. Another comment questions the feasibility of running “DeepSeek-V4-Pro-Max” locally, suggesting skepticism about its practical deployment in local environments.
- The mention of the 1600B model being run on a Raspberry Pi is technically intriguing, suggesting significant advancements in model efficiency and hardware compatibility. This implies that even large models can now be optimized to run on low-power devices, which could democratize access to powerful AI capabilities.
- The reference to Qwen3.5-122B-A10B suggests a discussion around a specific model variant, possibly highlighting its parameter size or architecture. This could indicate a trend towards more specialized or optimized models that balance size and performance for specific tasks or hardware configurations.
- The comment on parameter sizes being a ‘dumb’ metric reflects a technical debate on the relevance of parameter count as a measure of model capability. This suggests a shift towards evaluating models based on performance metrics like accuracy, efficiency, or real-world applicability rather than just size.
DeepSeek released ‘Thinking-with-Visual-Primitives’ framework (Activity: 345): DeepSeek, in collaboration with Peking University and Tsinghua University, has introduced a novel multimodal reasoning framework called ‘Thinking with Visual Primitives’. This framework elevates spatial tokens, such as coordinate points and bounding boxes, to serve as the “minimal units of thought” in the model’s chain-of-thought process. This approach allows the model to directly interleave these spatial tokens during reasoning, effectively enabling it to “point” to specific locations within an image while processing information. The framework was initially released on GitHub but was quickly made private, likely due to internal data or paths needing removal. GitHub Repository. Commenters noted that this approach could significantly enhance open models by enforcing spatial awareness and preventing attention drift, a common issue with complex images. There is anticipation for integrating this framework with models like Llama once the repository is available again.
- The ‘Thinking-with-Visual-Primitives’ framework by DeepSeek introduces a novel approach where models output raw bounding box coordinates as tokens, enhancing spatial awareness and reducing attention drift in complex images. This method contrasts with traditional natural language descriptions, which can be vague and lead to inaccuracies in spatial reasoning. The framework’s potential integration with models like Llama could significantly improve their performance once the code is publicly available again.
- DeepSeek’s release strategy involves initially making their repositories public and then quickly setting them to private, possibly to remove sensitive internal data. This approach allows them to bypass formal review processes while still gaining community attention and credit. The strategy also relies on the community to mirror and fork the repositories, ensuring the code remains accessible despite the temporary privacy.
- The framework’s concept aligns with existing efforts by companies like Google, which have explored similar ideas, though documentation and research on such methods have been sparse. The use of visual primitives for spatial reasoning could represent a significant advancement in open models, potentially influencing future developments in AI spatial awareness and reasoning capabilities.
Where the goblins came from (Activity: 359): The OpenAI article titled “Where the Goblins Came From” discusses the challenges and methodologies in training large-scale AI models, particularly focusing on the implications of embedding vast amounts of knowledge into model parameters. The discussion references Sutton’s Bitter Lesson, which emphasizes the superiority of scalable compute over hand-crafted algorithms. The article critiques the approach of embedding extensive prior knowledge into models, suggesting that this contradicts Sutton’s advice to focus on systems that discover patterns autonomously. The latest OpenAI model, estimated at 10 trillion parameters, is highlighted as an example of this approach, raising questions about the efficiency and necessity of such scale in AI training. The comments debate the interpretation of Sutton’s Bitter Lesson, with some arguing that OpenAI’s approach of embedding extensive knowledge into models contradicts Sutton’s emphasis on scalable compute for autonomous pattern discovery. Others suggest that alternative methods, such as knowledge graphs and reasoning engines, could avoid embedding unnecessary information like ‘goblins’ into models.
- Luke2642 discusses the misinterpretation of Sutton’s ‘bitter lesson’ in AI research, emphasizing that Sutton advocated for scaling compute to enable systems to discover patterns independently, rather than embedding extensive prior knowledge into models. This contrasts with the approach of large models like OpenAI’s, which use massive parameter counts (e.g., 10 trillion) to encode vast amounts of human knowledge, including trivial data like ‘goblins’. This approach is critiqued as inefficient compared to potentially more effective methods like knowledge graphs or reasoning engines.
- Luke2642 also highlights the efficiency of Chinese researchers in applying less compute to achieve similar or better results, suggesting they may have developed superior algorithms or architectures. This raises questions about the current trend of scaling parameters and data in AI models, suggesting that alternative methods could avoid the pitfalls of embedding unnecessary information, such as ‘goblins’, into AI systems.
“What do you guys even use local LLMs for?” Me: A lot (Activity: 469): The image is a dashboard from Grafana, displaying metrics related to the usage of local Large Language Models (LLMs) over a six-hour period. It tracks various statistics such as total tokens used, generation speed, and throughput, providing insights into the performance and utilization of different models and applications. The dashboard highlights that applications like “Hermes” and “Vane” have the highest usage counts, indicating their significant role in the user’s local LLM ecosystem. The user has implemented a system to log usage via Prometheus, which helps in monitoring and optimizing the performance of these models. One commenter notes that the token usage is substantial, but suggests that it would need to be in the billions to be considered ‘a lot.’ Another commenter discusses the cost-saving benefits of using local LLMs for initial code review, which reduces the need for expensive API calls.
- spencer_kw discusses using a local LLM, specifically ‘qwen’, for code review before sending code to an API model like ‘opus’. This approach catches about 60% of obvious mistakes, significantly reducing API usage and saving approximately $80/month in costs. This highlights the cost-effectiveness of local LLMs in pre-processing tasks before utilizing more expensive cloud-based models.
- CalligrapherFar7833 suggests using local LLMs for initial data filtering, such as detecting relevant frames before processing with a vision LLM. This strategy can optimize performance by reducing the amount of unnecessary data processed by more resource-intensive models, thereby improving efficiency and potentially lowering computational costs.
- Nyghtbynger emphasizes the importance of monitoring resource usage and costs when using local models. They find provider dashboards useful for tracking metrics like money spent and cache usage, which are critical for managing the efficiency and cost-effectiveness of local LLM deployments.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. AI Model Releases and Benchmarks

... [Content truncated due to size limits]

Read full article